Certified Associate Developer for Apache Spark Best Training Material and Practice Test Q&A from CertLibrary.com

Question 1

Which of the following code blocks will most quickly return an approximation for the number of distinct values in column division in DataFrame storesDF?

A. storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))
B. storesDF.agg(approx_count_distinct(col("division"), 0.01).alias("divisionDistinct"))
C. storesDF.agg(approx_count_distinct(col("division"), 0.15).alias("divisionDistinct"))
D. storesDF.agg(approx_count_distinct(col("division"), 0.0).alias("divisionDistinct"))
E. storesDF.agg(approx_count_distinct(col("division"), 0.05).alias("divisionDistinct"))

Answer : C

Question 2

The code block shown below contains an error. The code block is intended to return a new DataFrame with the mean of column sqft from DataFrame storesDF in column sqftMean. Identify the error.
Code block:
storesDF.agg(mean("sqft").alias("sqftMean"))

A. The argument to the mean() operation should be a Column abject rather than a string column name.
B. The argument to the mean() operation should not be quoted.
C. The mean() operation is not a standalone function – it’s a method of the Column object.
D. The agg() operation is not appropriate here – the withColumn() operation should be used instead.
E. The only way to compute a mean of a column is with the mean() method from a DataFrame.

Answer : A

Question 3

Which of the following operations can be used to return the number of rows in a DataFrame?

A. DataFrame.numberOfRows()
B. DataFrame.n()
C. DataFrame.sum()
D. DataFrame.count()
E. DataFrame.countDistinct()

Answer : D

Question 4

Which of the following operations returns a GroupedData object?

A. DataFrame.GroupBy()
B. DataFrame.cubed()
C. DataFrame.group()
D. DataFrame.groupBy()
E. DataFrame.grouping_id()

Answer : D

Question 5

Which of the following code blocks returns a collection of summary statistics for all columns in
DataFrame storesDF?

A. storesDF.summary("mean")
B. storesDF.describe(all = True)
C. storesDF.describe("all")
D. storesDF.summary("all")
E. storesDF.describe()

Answer : E

Question 6

Which of the following code blocks fails to return a DataFrame reverse sorted alphabetically based on column division?

A. storesDF.orderBy("division", ascending – False)
B. storesDF.orderBy(["division"], ascending = [0])
C. storesDF.orderBy(col("division").asc())
D. storesDF.sort("division", ascending – False)
E. storesDF.sort(desc("division"))

Answer : C

Question 7

Which of the following code blocks returns a 15 percent sample of rows from DataFrame storesDF without replacement?

A. storesDF.sample(fraction = 0.10)
B. storesDF.sampleBy(fraction = 0.15)
C. storesDF.sample(True, fraction = 0.10)
D. storesDF.sample()
E. storesDF.sample(fraction = 0.15)

Answer : E

Question 8

Which of the following code blocks returns all the rows from DataFrame storesDF?

A. storesDF.head()
B. storesDF.collect()
C. storesDF.count()
D. storesDF.take()
E. storesDF.show()

Answer : B

Question 9

Which of the following code blocks applies the function assessPerformance() to each row of DataFrame storesDF?

A. [assessPerformance(row) for row in storesDF.take(3)]
B. [assessPerformance() for row in storesDF]
C. storesDF.collect().apply(lambda: assessPerformance)
D. [assessPerformance(row) for row in storesDF.collect()]
E. [assessPerformance(row) for row in storesDF]

Answer : D

Question 10

The code block shown below contains an error. The code block is intended to print the schema of DataFrame storesDF. Identify the error.
Code block:
storesDF.printSchema

A. There is no printSchema member of DataFrame – schema and the print() function should be used instead.
B. The entire line needs to be a string – it should be wrapped by str().
C. There is no printSchema member of DataFrame – the getSchema() operation should be used instead.
D. There is no printSchema member of DataFrame – the schema() operation should be used instead.
E. The printSchema member of DataFrame is an operation and needs to be followed by parentheses.

Answer : E

Question 11

The code block shown below should create and register a SQL UDF named "ASSESS_PERFORMANCE" using the Python function assessPerformance() and apply it to column customerSatisfaction in table stores. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
spark._1_._2_(_3_, _4_)
spark.sql("SELECT customerSatisfaction, _5_(customerSatisfaction) AS result FROM stores")

A. 1. udf
2. register
3. "ASSESS_PERFORMANCE"
4. assessPerformance
5. ASSESS_PERFORMANCE
B. 1. udf
2. register
3. assessPerformance
4. "ASSESS_PERFORMANCE"
5. "ASSESS_PERFORMANCE"
C. 1. udf
2. register
3."ASSESS_PERFORMANCE"
4. assessPerformance
5. "ASSESS_PERFORMANCE"
D. 1. register
2. udf
3. "ASSESS_PERFORMANCE"
4. assessPerformance
5. "ASSESS_PERFORMANCE"
E. 1. udf
2. register
3. ASSESS_PERFORMANCE
4. assessPerformance
5. ASSESS_PERFORMANCE

Answer : A

Question 12

The code block shown below contains an error. The code block is intended to create a Python UDF assessPerformanceUDF() using the integer-returning Python function assessPerformance() and apply it to column customerSatisfaction in DataFrame storesDF. Identify the error.
Code block:
assessPerformanceUDF – udf(assessPerformance)
storesDF.withColumn("result", assessPerformanceUDF(col("customerSatisfaction")))

A. The assessPerformance() operation is not properly registered as a UDF.
B. The withColumn() operation is not appropriate here – UDFs should be applied by iterating over rows instead.
C. UDFs can only be applied vie SQL and not through the DataFrame API.
D. The return type of the assessPerformanceUDF() is not specified in the udf() operation.
E. The assessPerformance() operation should be used on column customerSatisfaction rather than the assessPerformanceUDF() operation.

Answer : D

Question 13

The code block shown below contains an error. The code block is intended to use SQL to return a new DataFrame containing column storeId and column managerName from a table created from DataFrame storesDF. Identify the error.
Code block:
storesDF.createOrReplaceTempView("stores")
storesDF.sql("SELECT storeId, managerName FROM stores")

A. The createOrReplaceTempView() operation does not make a Dataframe accessible via SQL.
B. The sql() operation should be accessed via the spark variable rather than DataFrame storesDF.
C. There is the sql() operation in DataFrame storesDF. The operation query() should be used instead.
D. This cannot be accomplished using SQL – the DataFrame API should be used instead.
E. The createOrReplaceTempView() operation should be accessed via the spark variable rather than DataFrame storesDF.

Answer : B

Question 14

The code block shown below should create a single-column DataFrame from Python list years which is made up of integers. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
_1_._2_(_3_, _4_)

A. 1. spark
2. createDataFrame
3. years
4. IntegerType
B. 1. DataFrame
2. create
3. [years]
4. IntegerType
C. 1. spark
2. createDataFrame
3. [years]
4. IntegertType
D. 1. spark
2. createDataFrame
3. [years]
4. IntegertType()
E. 1. spark
2. createDataFrame
3. years
4. IntegertType()

Answer : E

Question 15

The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error.
Code block:
storesDF.cache().count()

A. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be specified to MEMORY_ONLY as an argument to cache().
B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().
C. The storesDF DataFrame has not been checkpointed – it must have a checkpoint in order to be cached.
D. DataFrames themselves cannot be cached – DataFrame storesDF must be cached as a table.
E. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead.

Answer : B

Certified Associate Developer for Apache Spark v1.0

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Talk to us!